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Abstract 

Counterfactual Regret Minimization (CFR) is an efficient no-regret learning al- 
gorithm for decision problems modeled as extensive games. CFR's regret bounds 
depend on the requirement of perfect recall: players always remember information 
that was revealed to them and the order in which it was revealed. In games without 
perfect recall, however, CFR's guarantees do not apply. In this paper, we present 
the first regret bound for CFR when applied to a general class of games with im- 
perfect recall. In addition, we show that CFR applied to any abstraction belonging 
to our general class results in a regret bound not just for the abstract game, but for 
the full game as well. We verify our theory and show how imperfect recall can 
be used to trade a small increase in regret for a significant reduction in memory in 
three domains: die-roll poker, phantom tic-tac-toe, and Bluff. 



1 Introduction 

Many real-world problems can be modeled as a repeated decision-making task. For 
problems involving multiple agents, one can model the repeated task as a normal-form 
game. When the task incorporates sequential decisions involving imperfect informa- 
tion or stochastic events, an extensive game is a useful alternative. In such decision 
problems, a typical goal is to minimize regret: the amount of utility lost by playing a 
past sequence of strategies, versus playing the best, stationary strategy in hindsight. 

In this paper, we consider the problem of minimizing regret in an extensive game. 
A common approach to achieving low regret in extensive games is the Counterfactual 



Regret Minimization (CFR) | Zinkevich et al. 2008 1 algorithm. CFR uses a regret 



minimizer at every decision point with an alternative notion of regret, which provably 
minimizes regret in the entire extensive game. However, convergence is limited to 
games exhibiting perfect recall: players never forget information that was revealed 
to them, nor the order in the which the information was revealed. For games with 
imperfect recall, CFR's original analysis provides no general guarantees. 

Imperfect recall brings about a number of complications. In games with per- 
fect recall, every mixed strategy (probability distribution over pure strategies) has a 
utility-equivalent behavioral strategy (probability distribution over actions at each de- 
cision point) | |Kuhn| [T953| . While certain lossless imperfect recall games share this 
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property |Kaneko and Kline, 1995 1, it is not true for imperfect recall games in gen- 
eral [Piccione and Rubinstein 1996 1. In addition, the decision problem of determining 
if a player can assure themself a certain payoff in an imperfect recall game is NP- 
complete | |Koller and Megiddo| 1992) . Two-player zero-sum games can be solved 
by constructing an appropriate linear program [Roller et al. 1994 1 or minimizing re- 
gret |Zinkevich et al. 2008 1, provided the game has perfect recall. Without perfect re- 
call, however, the problem becomes exponential in the worst case | Roller et al.| 1 1 994| . 

On the other hand, imperfect recall extensive games are more versatile than perfect 
recall games for modelling large real-world problems. While perfect recall requires all 
past information to be remembered, imperfect recall allows irrelevant information to be 
forgotten so that the size of the game is smaller. As CFR's memory requirements are 
linear in the size of the game, more games become feasible through imperfect recall. 
Despite the complications above, CFR has empirically been shown to work well when 
applied to imperfect recall abstractions of Texas Hold' em poker | Waugh et al. , 2009b |, 
but there is currently no theory to suggest why this is so. 

This paper presents theoretical groundings for applying CFR to games exhibiting 
imperfect recall. We define a general class of imperfect recall games and provide a 
bound on CFR's regret in such games. For a subset of this class, CFR minimizes aver- 
age regret in the extensive game. Moreover, our results also provide regret guarantees 
when applying CFR to an abstract game, provided the abstract game belongs to our 
general class. We test our theory in three different domains: die-roll poker, phantom 
tic-tac-toe, and Bluff. To the best of our knowledge, this work demonstrates the first 
theoretically-grounded, practical use of imperfect recall in extensive games. 



2 Background 



An extensive-form game T with imperfect information | Osborne and Rubinstein 1994) 



is a tuple (N, A, H, Z, P, a c , u, T), where N is a finite set of players. A is a finite set 
of actions. H is a finite set of histories: a subset of the set of sequences of elements in 
A. A prefix of a history h! G H is a history h € H where h' begins with the sequence 
h; we denote prefix histories by h C h'. For every h E H, define A(h) — {a : a € 
A, ha € H}, the set of valid actions at history h; P(h) £ N U {c} is the player to act 
at the history h, or chance if P{h) = c; and Hi = {h | h G H, P(h) = i}. Z C H is 
the set of terminal histories. A terminal history z G Z is a history where there does 
not exist any history h G H ', h ^ z such that z C h. The utility function Uj : Z — > M 
gives the utility to player i G N, for each terminal history. If |JV| = 2 and for all 
zeZ, J2ieN u i( z ) = 0' we sa Y me g ame is zero-sum. 

For each player i G N, li is a partition of Hi with the property that A(h) = A(h') 
whenever h and h! are in the same member of the partition. We call I, the information 
partition of player i, and a set I G Xj is an information set of player i. A player, 
when taking actions, cannot distinguish between two histories in the same information 
set. For I G 1% we denote A(I) as the set A(h) for any h € I. Define 1(h) to be 
the information set containing h. In this paper, we restrict ourselves to games where 
players cannot reach the same information set twice in a single game. Thus, we assume 
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that for all i £ N and h,h' £ H l7 

h^ti ,h^ti ^ 1(h) ^ I(ti). (1) 

Finally, a c is the fixed "strategy" of the special player chance. a c (h, a) gives the 
probability that chance event a occurs at h. For all h £ H c , J2 a eA(h) a) = 1 and 
the decisions at any h are independent of the decision at any other h! 7^ h. 

Given a history h, define Xi(h) to be the sequence of information set, action pairs 
such that ( J, a) £ X{ (h) if I £ Xj and there exists ft' C ft such that hf £ I and ft/ a C ft. 
The order of the pairs in Xi(h) is the order in which they occur in ft. Define X(h) to 
be the sequence of information set, action pairs belonging to all players in the order in 
which they occur in ft, and X_i(h) similarly, by removing player i's information set, 
action pairs from X(h). Also, define X(h, h!) to be the sequence of information set, 
action pairs belonging to all players that start at h and end at h' when ft C ft'; if ft g ft', 
X(h, h') is defined to be the empty sequence. Finally, Xi(h, h') and X-i(h, h') are 
similarly defined. 

Definition 1. An extensive game has perfect recall if for every player i £ N, for every 
information set I £ Ti, for any ft, h! £ I : Xi(h) = Xi(h'). Otherwise, the game has 
imperfect recall. 

Intuitively, with perfect recall every player has an infallible memory: they cannot 
"forget" anything during a play of the game that they once knew. Hence, what a player 
knows at / is a composition of what the player has discovered in the past up to this point 
and the precise order in which information was discovered. Note that every perfect 
recall game satisfies equation ([T}, but not every imperfect recall game does. 

A (behavioral) strategy o~i for player i is a function such that for each history 
ft £ Hi, ai{h) is a probability distribution over A(h). Furthermore, it is required that 
<7,(ft) = <Ti(h') for all ft, h' £ I, and we denote that as (Tj(/). The set of all such 
strategies for player i is denoted by X,-. A strategy profile a £ E is a collection of 
strategies, one for each player, i.e. in a two-player game a — (<n, 02)- By notational 
convention, cr_ j refers to the set of strategies including every strategy in a except player 
i's strategy. 

For any a 6 S, i £ NU {c}, and h £ H, define nf (h) — Yl h , anh p(/,/)=» a *(^ '' a ) 
to be the probability that player i plays to reach history h under a. We can then define 
TT a {h) = YiieNu{c} ^ifi) t° be the probability that history h is reached under a. Let 
T-iih) be the product of all players' contribution (including chance) except that of 
player i. Furthermore, let 7rf (h, h!) be the probability of player i playing to reach his- 
tory h! after h, given h has occurred. Let ^"{h, h!) and ir'L^h, h!) be defined similarly. 
Finally, we can define the expected utility of a strategy profile a for player i to be 

Ui(a) = E zeZ [ui{z)} = 2J M 4 (z)7r cr (z). 

We will say that a game V = (N, A' ', H, Z, P, a c , u,I') is an abstraction, or an 
abstract game, of T = (N,A,H,Z,P,a c ,u,l) if for all i £ N and h, k £ H t , 
A'(h) C A(h) and 1(h) = I(k) implies I'(h) = I'(k). In this paper, we only consider 
abstractions where A = A'. A typical use of abstraction is to reduce the size of the 
game by ensuring that < 
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3 Example: Die-Roll Poker 



We now introduce a game that we will use as a running example throughout the paper. 

Die-roll poker (DRP) is a simplified two-player poker game that uses dice rather 
than cards. To begin, each player antes one chip to the pot. There are two betting 
rounds, where at the beginning of each round, players roll a private six-sided die. The 
game has imperfect information due to the players not seeing the result of the oppo- 
nent's die rolls. During a betting round, a player may fold (forfeit the game), call 
(match the current bet), or raise (increase the current bet) by a fixed number of chips, 
with a maximum of two raises per round. In the first round, raises are worth two chips, 
whereas in the second round, raises are worth four chips. If both players have not 
folded by the end of the second round, a showdown occurs where the player with the 
largest sum of their two dice wins all of the chips in the pot. 

DRP is naturally a game with perfect recall; players remember the exact sequence 
of bets made and the exact outcome of each die roll from both rounds. However, 
consider an imperfect recall version of DRP, DRP-IR, where at the beginning of the 
second round, both players "forget" their first die roll and only know the sum of their 
two dice. In other words, DRP-IR is an abstraction of DRP where any two histories 
are in the same abstract information set if and only if the sum of the player's private 
dice is the same and the sequence of betting is the same. DRP-IR has imperfect recall 
since histories that were distinguishable in the first round (for example, a roll of 1 and 
a roll of 4) are no longer distinguishable in the second round (for example, a roll of 1 
followed by a roll of 5, and a roll of 4 followed by a roll of 2). 

4 Counterfactual Regret Minimization 

Given a sequence of strategy profiles a 1 , <r 2 , a T , the (external) regret for player i, 

T 

Rf = max Mc', &U) - u Ml c-i)) > 
a * t=i 

is the amount of utility player i could have gained had she played the best single strat- 
egy in hindsight for all time steps t € {1, 2, T}. An algorithm minimizes regret, 
or is a no-regret algorithm, for player i if the average positive regret approaches zero; 
i.e., liniT^oo /T = 0, where x + = max{a;,0}. Having no regret is a desirable 
property. For example, it is well known that in a zero-sum game, if both players' aver- 
age regret is bounded above by e, then the average of the strategy profiles generated is 
a 2e-Nash equilibrium. 

Counterfactual Regret Minimization (CFR) is an iterative no-regret learning al- 
gorithm for extensive-form games having perfect recall. On each iteration t, CFR 
recursively traverses the entire game tree, computing the expected utility for player i 
at each information set I £ Zj under the current profile a 1 , assuming player i plays to 
reach /. This expectation is the counterfactual value for player i, 

Vi (a,I)= £ Uj (z)^(z[/])7r CT (z[/],z), 

z£Z T 
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where Zj is the set of terminal histories passing through / and z[I] is the prefix of z 
contained in / (z[I] is unique by equation ([T|). For each action a E A(I), these values 
determine the counterfactual regret at iteration t, r|(J, a) = Vi{o\_^ a , I) ~ Vi(a t , I), 
where aj^ a is the profile a except at /, action a is always taken. The regret r\ (I, a) 
measures how much player i would rather play action a at / than play cr*. Finally, <r* 
is upd ated by applying regret matching [Hart and Ma s-ColelT} |2000| |Zinkevich et al. 



2008 1 to the immediate counterfactual regrets, R{ (I, a) = j^ t=1 rj (I, a), according 



to 

- T+l (/, a ) = — ^-M^$ — , 

with actions chosen uniformly at random when the denominator is zero. Regret match- 
ing is a no-regret learner that minimizes the per-information set immediate counterfac- 
tual regret, 

max < (2) 

where Aj = max 2 tZ ' e z Ui(z) — Ui(z'). In games having perfect recall, minimizing the 
immediate counterfactual regrets at every information set in turn minimizes average 
regret, Rj /T. This is because perfect recall implies that the regret is bounded by 



the sum of the positive parts of the immediate counterfactual regrets |Zinkevich et al. 
[20081 , 



Rf< huh -Rj' + (I,a), (3) 




and thus 



(4) 

T 

where \Ai\ — max/ G Xi l^(-0l- CFR must store the immediate counterfactual re- 
gret for each information set, action pair, and thus CFR's memory requirements are 

od^llAi). 

While equation |2]) still holds in imperfect recall games, equation ^ and conse- 
quently equation Q are not guaranteed to hold. An example game where CFR would 
exhibit high regret is provided in Section [7] Consequently, the regret for playing ac- 
cording to the CFR algorithm is unknown in general for imperfect recall games. How- 
ever, the advantage of applying CFR to DRP-IR, for example, is that this imperfect 
recall game contains fewer information sets than the full game, and thus less memory 
is required by CFR. Although DRP is a toy example and is small enough to run CFR 
on the full game, this example is useful for understanding the concepts in the rest of 
this paper. 



5 CFR with Imperfect Recall 

In this section, we investigate the application of CFR to games with imperfect recall. 
We begin by showing that CFR minimizes regret for a class of games that we call "well- 
formed games." We then present a bound on the average regret for a more general class 
of imperfect recall games that we call "skew well-formed games." 
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5.1 Well-formed Games 

For games T = (N, A, H, Z, P, a c , u,X) and f = (N, A, H, Z, P, a c , u,X), we say 
that r is a perfect recall refinement of T if V has perfect recall and T is an abstraction 
of f. So, the information available to players in T is never forgotten, and is at least as 
informative as the information available to them in T. For example, DRP is a perfect 
recall refinement of DRP-IR. Every game has at least one perfect recall refinement by 
simply making T a perfect information game (I — {h} for all / € %). Furthermore, a 
perfect recall game is a perfect recall refinement of itself. For / e I. L , we define 

P{i) = {i\i eiijci} 

to be the set of all information sets in Z, that are subsets of /. Note that our notion of 



refinement is similar to the one described by Kaneko & Kline ( 1995 i. Our definition 
differs in that we consider any possible refinement, whereas Kaneko & Kline consider 
only the coarsest such refinement. 

Definition 2. For a game T and a perfect recall refinement T, we say that T is a well- 
formed game with respect to T if for all i € N, I G Ii, /, /' e ~P(I), there exists a 
bijection <fi : Zj — > Zp and constants kj p } £j f, € [0, oo) such that for all z € Zp' 

(i) Ui(z) = kjj,Ui((f)(z)), 

(ii) tt c (z) = e Ip iT c ((f>(z)), 

(iii) In T, X^(z) = X-i((f>(z)), and 

(iv) In T, Xi(z[i}, z) = Xi{<t>{z)[I'],<f){z)). 

We say that T is a well-formed game if it is well-formed with respect to some perfect 
recall refinement. 

Recall that Zj is the set of terminal histories containing a prefix in the informa- 
tion set /, and that z[I] is that prefix. Intuitively, a game is well-formed if for each 
information set / € Ii, the structures around each /, /' 6 of some perfect re- 

call refinement are isomorphic across four conditions. Conditions (i) and (ii) state 
that the corresponding utilities and chance frequencies at each terminal history are 
proportional. Condition (iii) asserts that the opponents can never distinguish the corre- 
sponding histories at any point in T. Finally, condition (iv) states that player i cannot 
distinguish between corresponding histories from / and I' until the end of the game. 

Consider again DRP as a perfect recall refinement of DRP-IR. In DRP, the available 
actions are independent of dice outcomes, and the final utilities are only dependent on 
the final sum of the players' dice. Therefore, in DRP the utilities are equivalent be- 
tween, for example, the terminal histories where player i rolled a 1 followed by a 5, 
and the terminal histories where player i rolled a 4 followed by a 2 (condition (i)). In 
addition, the chance probabilities of reaching each terminal history are equal (condi- 
tion (ii)). Furthermore, the opponents can never distinguish between two isomorphic 
histories since player i's rolls are private (condition (iii)). Finally, in DRP-IR, player i 
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never remembers the outcome of the first roll from the second round on (condition (iv)). 
Thus, DRP-IR is well-formed with respect to DRP, with constants kj p = if p = 1. 

Any perfect recall game is well-formed with respect to itself since V{I) = {I}, <j> 
equal to the identity bijection, and kj p = ij p = 1 satisfies Definition |2j However, 
many imperfect recall games are also well-formed, with DRP-IR being one example. 
An additional example is presented in Section|6] 

We now show that CFR can be applied to any well-formed game to minimize aver- 
age regret. A sketch of the proof is described below, while a full proof is provided as 
supplementary material. 

Theorem 1. If T is well-formed with respect to T, then the average regret in T for 
player i of choosing strategies according to CFR in T is bounded by 

Rf A t K^/\A-\ 
T ~ y/T ' 

where K = E JeI . maxj j, £ .p (7) kppipp. 

Proof sketch. One can show that conditions (i) to (iv) of Definition|2]imply that the 
positive regrets are proportional between any two information sets in T that are merged 
in the well-formed game, V. In other words, for all I 6 Zj, /, I' <E P(I), and a £ A(I), 

Rj' + (I,a) = kppl I pRj' + (I',a). 

Since regrets between V and T are additive, i.e., 

Rj{I,a)= {I,a) for all I el h 

lev(i) 

the proportionality implies that minimizing regret at each I 6 1-i minimizes regret at 
each / 6 i%. Because T has perfect recall, applying equation Q gives the result. ■ 

Since the strategy space is more expressive in T than in T (E C E), Rf < Rf 
and thus it immediately follows that the average regret in T is minimized. In the case 
when T has perfect recall, because T is well-formed with respect to itself, Theorem 
1 with K = is a direct generalization of the original CFR bound in equation Q. 
Theorem [T] not only guarantees regret minimization for perfect recall games, but also 
for well-formed imperfect recall games. 

5.2 Skew Well-formed Games 

We now present a generalization of well-formed games to which a regret bound can 
still be derived. 

Definition 3. For a game T and a perfect recall refinement T, we say that T is a skew 
well-formed game with respect to T if for all i 6 N, I € Xj, /. /' 6 V{I), there exists 
a bijection cj> : Zj — > Zp and constants kj f,,Sf p,£f p € [0, oo) such that for all 

zeZp 
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(i) m(z) - kj j,Ui(4>(z)) <Sj p, 

(ii) tt c (z) = £jJ,TT c ((p(z)), 

(iii) In T, X^(z) = X^((f)(z)), and 

(iv) In T, Xi(z[I],z) = Xi(<l>(z)[P],<f>(z)). 

We say that T is a skew well-formed game if it is skew well-formed with respect to some 
perfect recall refinement. 

The only difference between Definitions[2]and[3]is in condition (i). While utilities 
must be exactly proportional in a well-formed game, utilities in a skew well-formed 
game must only be proportional up to a constant 8j p. Note that any well-formed 
game is skew well-formed by setting dj p =0. 

For example, consider a new version of DRP called Skew-DRP(^) with slightly 
modified payouts at the end of the game. Whenever the game reaches a showdown, 
player 1 receives a bonus <5 times the number of chips in the pot from player 2 if 
player 1 's second die roll was even; otherwise, no bonus is awarded. The pot is then 
awarded to the player with the highest dice sum as usual. Analogously, define Skew- 
DRP-IR(<5) to be the imperfect recall abstraction of Skew-DRP(<5) where in the second 
round, players only remember the sum of their two dice. Now, Skew-DRP-IR(<5) is not 
well-formed with respect to Skew-DRP(i5). To see this, note that the utilities resulting 
from the rolls 1,5 and the rolls 4,2 and the same sequence of betting are not exactly 
proportional because the second roll 5 is odd but 2 is even (utilities are off by S times 
the pot size). However, Skew-DRP-IR((5) is skew well-formed with respect to Skew- 
DRP(<5) with Sf p = S times the maximum pot size attainable from /. 

Unfortunately, there is no guarantee that regret will be minimized by CFR in a 
skew well-formed game. However, we can still bound regret in a predictable manner 
according to the degree that the utilities are skewed: 

Theorem 2.IfT is skew well-formed with respect to T, then the average regret in T 
for player i of choosing strategies according to CFR in T is bounded by 



/ex, 

where K = J2iex max f Pev(i) ^1 1'^i r an d ^ ~ max f r<z-p(i) $1 p^-i i>- 

The proof is similar to that of Theorem [T] Theorem [8] shows that as T approaches 
infinity, the bound on our regret approaches J2i<=i k (-01^-T- Our experiments in Sec- 
tion|6]demonstrate that as the skew <5 grows, so does our regret in Skew-DRP((5) after a 
fixed number of iterations. 

Remarks. Theorems [T] and [8] are, to our knowledge, the first to provide such theo- 
retical guarantees in imperfect recall settings. However, these results are also relevant 
with regards to regret in the full game when CFR is applied to an abstraction. Recall 
that if r has perfect recall, then T is a perfect recall refinement of any (skew) well- 
formed abstract game. Thus, if we choose an abstraction that yields a (skew) well- 
formed game, then applying CFR to the abstract game achieves a bound on the average 
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regret in the full game, V. This is true regardless of whether the abstraction exhibits 
perfect recall or imperfect recall. Previous counterexamples show that abstraction in 
general provides no guarantees in the full game [Wau ghet al.||2009a| . In contrast, our 
results show that applying CFR to an abstract game leads to bounded regret in the full 
game, provided we restrict ourselves to (skew) well-formed abstractions. If such an 
abstract game is much smaller than the full game, a significant amount of memory is 
saved when running CFR. 

6 Empirical Evaluation 

To complement our theoretical results, we apply CFR to both players simultaneously in 
several zero-sum imperfect recall (abstract) games, and measure the sum of the average 
regrets for both players in a perfect recall refinement (the full game). Along with 
the small DRP domain and its variants, we also consider the challenging domains of 
phantom tic-tac-toe and Bluff, which we now describe. 

Phantom tic-tac-toe. As in regular tic-tac-toe, phantom tic-tac-toe (PTTT) is 
played on a 3-by-3 board, initially empty, where the goal is to claim three squares 
along the same row, column, or diagonal. However, in PTTT, players' actions are 
private. Each turn, a player attempts to take a square of their choice. If they fail due to 
the opponent having taken that square on a previous turn, the same player keeps trying 
to take an alternative square until they succeed. Players are not informed about how 
many attempts the opponent made before succeeding. The game ends immediately 
if there is ever a connecting line of squares belonging to the same player. The winner 
receives a payoff of +1, while the losing player receives —1. In PTTT, the total number 
of histories \H\ « 10 10 . 

Bluff. Bluff, also known as Liar's Dice, Dudo, and Perudo, is a dice-bidding game. 
In our version, Bluff(£>i,£>2), each die has six sides with faces 1 to 6. Each player i 
rolls Di of these dice and looks at them without showing them to the opponent. Each 
round, players alternate by bidding on the outcome of all dice in play until one player 
claims that the other is bluffing {i.e., claims that the bid does not hold). A bid consists 
of a quantity of dice and a face value. A face of 6 is considered "wild" and counts 
as matching any other face. For example, the bid 2x5 represents the claim that there 
are at least two dice with a face of 5 (or 6) among both players' dice. To place a new 
bid, the player must increase either the quantity or face value of the current bid; in 
addition, lowering the face is allowed if the quantity is increased. The player calling 
bluff wins the round if the opponent's last bid is incorrect, and loses otherwise. The 
losing player removes one of their dice from the game and a new round begins, starting 
with the player who won the previous round. When a player has no more dice left, they 
have lost the game. A utility of +1 is given for a win and —1 for a loss. In this paper, 
we restrict ourselves to the case where D\ = D 2 = 2. Note that since Bluff(2,2) is a 
multi -round game, the expected values of Bluff(l,l) are precomputed for payoffs at the 
leaves of Bluff(2,l), which is then solved for leaf payoffs in the full Bluff(2,2) game. 
In Bluff(2,2), the total number of histories \H\ « 10 10 . 
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Table 1: PTTT and Bluff game sizes and properties. 



Game 


Abstr. 


Well-for. 


1 A 1 


Savings 


DRP 


None 


Yes 


2610 


— 


DRP 


DRP-IR 


Yes 


860 


67.05% 


DTTT 

rill 


None 


Yes 


1 1695314 




PTTT 


FOSF 


Yes 


9347010 


20.08% 


PTTT 


FOI 


No 


1147530 


90.19% 


PTTT 


FOS 


No 


1484168 


87.31% 


PTTT 


FOE 


No 


47818 


99.59% 


Bluff 


None 


Yes 


704643030 




Bluff 


r = 10 


No 


295534218 


58.06% 


Bluff 


r = 8 


No 


108323418 


84.63% 


Bluff 


r = 6 


No 


22518468 


96.80% 


Bluff 


r = 4 


No 


2329068 


99.67% 


Bluff 


r = 3 


No 


543900 


99.92% 


Bluff 


r = 2 


No 


97608 


99.97% 


Bluff 


r = 1 


No 


12600 


99.99% 



6.1 Results 

We consider several different imperfect recall abstractions for DRP, Skew-DRP(5), 
PTTT, and Bluff. For the DRP games, we apply DRP-IR and Skew-DRP-IR(<5) re- 
spectively as described in Section [5] Our PTTT and Bluff experiments, however, also 
investigate the effects of imperfect recall beyond skew well-formed games. In the full, 
perfect recall version of PTTT, each player remembers the order of every failed and ev- 
ery successful move she makes throughout the entire game. In our first abstract game, 
FOSF, players forget the order of successive failures within the same turn. Clearly, 
there is an isomorphism between any two merged information sets /, I' e V{I) since 
the order of the actions does not affect the available future moves or utilities. Players 
still remember which turn each success and each failure occurred, and so the oppo- 
nent's sequences of actions must be equal across the isomorphism. Thus, FOSF is well- 
formed. Our remaining PTTT abstractions, however, are not even skew well-formed. 
In FOI, players independently remember the sequence of failures and the sequence of 
successful actions, but not how the actions interleave. In FOS, players remember the 
order of failed actions, but not the order of successes. Finally, in FOE, players only 
know what actions they have taken and remember nothing about the order in which they 
were taken. FOI, FOS, and FOE are not skew well-formed because no isomorphism 
can preserve the order of the opponent's previous information set, action pairs (break- 
ing condition (iii) of Definitions [2] and [3]). In Bluff, we use abstractions described by 



Neller and Hnath (201 1 1 that force players to forget everything except the last r bids. 
Similarly, these abstract games are not skew well-formed because the players forget in- 
formation that the opponent could previously distinguish. The size of each DRP, PTTT, 
and Bluff game is given in Table [l] Here, A = {(I, a) : i G N,I € Ii,a G A(I)} is 
the set of all information set, action pairs. Note that Skew-DRP(<5) is the same size as 
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3 Well-formed Imperfect Recall (skew = 0) — I — 
1 ; Imperfect Recall (skew = 0.05) — -X— - 

Imperfect Recall (skew = 0.2) * 

i Imperfect Recall (skew ^ 0.8) , B , . , 

10 5 10 6 10 7 10 1 




Imperfect Recall (r = 8) —A—.. 
1Q -3 i Imperfect Recall (r = 1 0) . , . 

10 3 10 4 10 5 10 6 10 7 

Figure 1: Sum of average regrets for both players as iterations increase for Skew-DRP- 
IR(<5) (top), abstract games in PTTT (middle), and abstract games in Bluff (bottom). 
Each graph uses a log scale on both axes. The vertical axes represent the sum of average 
regret for both players in the corresponding full, unabstracted game, and horizontal 
axes represent iterations. 
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DRP regardless of the skew, and recall that CFR requires space linear in \ A\. 

For each game, we ran CFRQon both players, meaning that each player's oppo- 
nent was an identical copy of the same no-regret learner. The sum of the average 
regrets for each player over number of iterations is shown in Figure [T] The Skew- 
DRP-IR((5) experiments show that as S increases, so does the regret as predicted by 

Theorem 8 though Yliex ^CO $i appears to be a very loose bound on the final re- 
gret. In PTTT, regret diverges from zero for FOI, FOS, and FOE, where FOS appears 
to provide slightly better strategies than FOI and FOE. While our theory cannot explain 
why FOS performs better, this does match our intuition that remembering information 
about the opponent's moves is important. For a small increase in average regret, FOS 
reduces the space required by 87% compared to FOSF's 20% reduction. Note that for 
both DRP and PTTT, running CFR on the full, perfect recall game achieves the same 
regret as in the well-formed abstractions (Skew-DRP-IR(O) and FSOF) and is thus not 
shown. In Bluff, we see that regret consistently worsens as fewer previous bids are 
remembered. This suggests that a result similar to Theorem [8] for skew- well-formed 
games may hold if condition (iii) of Definition [2]is less constrained, though the proper 
formulation for such a relaxation remains unclear. Nonetheless, choosing r = 8 saves 
85% of the memory with only a very small increase in average regret after millions of 
iterations. 



7 Discussion 

Well-formed games are described by four conditions provided in Definition [2] Recall 
that Roller & Megiddo ( 1992) prove that determining a player's guaranteed payoff in 



an imperfect recall game is NP-complete. However, Roller & Megiddo's NP-hardness 
reduction creates an imperfect recall game that breaks conditions (i), (iii), and (iv) of 
Definition[2] In this section, we discuss the following question: For minimizing regret, 
how important is it to satisfy each individual condition of Definition [2]? 

Skew well-formed games and Theorem [8] show that one can relax condition (i) 
of Definition [2] and still derive a bound on the average regret. In addition, most of 
our PTTT and Bluff abstractions from the previous section do not satisfy condition 
(iii), but CFR still produces reliable results. This suggests that it may be possible to 
relax condition (iii) in a similar manner to the relaxation of condition (i) introduced by 
skew well-formed games. While we leave this question open, we now demonstrate that 
breaking condition (iii) can lead CFR to a dead-lock situation where one player has 
constant average regret. 

Let us walk through the process of applying CFR to the game in Figure [2] Note 
that this game satisfies all of the conditions of Definition [2] except for condition (iii). 
To begin, the current strategy profile a 1 is set to be uniform random at every infor- 
mation set. Under this profile, when player 1 is at 23, each of the four histories are 
equally likely. Thus, ^(oj^^, I3) = ^(o^^, J 3 ) = ^(cr 1 ,^) = 0, and so 
rl(Is, I) = r\(I 3 , r) = 0. In addition, under c 1 , the counterfactual value of the pass 



Similar to Zinkevich et at ( 2008 1, we used the chance sampling variant of CFR. 
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-1 +1 +1 -1 -1 +1 +1 -1 

Figure 2: A zero-sum game with imperfect recall where CFR does not minimize aver- 
age regret. The utilities for player 1 are given at the terminal histories, where £ G (0,1). 
Nodes connected by a bold, dashed curve are in the same information set for player 1 
(player 2 has perfect information). 

(p) and continue (c) actions at both I\ and I2 is zero, and thus the immediate counter- 
factual regrets at 1\ and I2 on iteration 1 are also zero. Player 2, however, has positive 
immediate counterfactual regret for passing (p) at histories ac and ec (to always receive 
£ utility) and for continuing (c) at be and de (to always avoid receiving — £ utility), and 
has negative immediate counterfactual regret for continuing at ac and ec and for pass- 
ing at be and de. Therefore, the next profile a 2 still has player 1 playing uniformly 
random everywhere, but player 2 now always passes at ac and ec, and always contin- 
ues at be and de. On the second iteration of CFR, the positive regrets for player 1 at 
^3 remain the same because the histories bee and dec are equally likely. Also, player 
2's positive regrets remain the same at all four histories in H^. However, player l's 
expected utility for continuing at 1\ or I2 is now negative since player 2 now passes 
at ac and ec. Thus, player 1 gains positive regret for passing at both 1\ and I2. This 
leads us to the next profile a 3 = {(Ii,p) — l,(/2,p) = l,(ac,p) — l,(bc,p) — 
0, {dc,p) = 0, (ec,p) = 1, (I3, 1) = 0.5}. One can check that running CFR for more 
iterations yields a 1 — a 3 for all t > 3. The average regret for playing this way 
will be constant and hence does not approach zero because player 1 would rather play 
a[ ={(h, p) = l,(/ a ,p) =0,(73,1) =0} and get u^a 3 ) = (1 - 0/4 > u^a 3 ) 
for £ G (0, 1). A similar example can be constructed where condition (iii) holds, but 
chance's probabilities are not proportional (breaking condition (ii)). 

Despite the problem of breaking condition (iii), condition (iv) of Definition|2]can be 
relaxed. Rather than enforcing player i's future information to be the same across the 
bijection <j>, we only require that the corresponding subtrees be isomorphic, allowing 
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player i to re-remember information that was previously forgotten. The details for 
this relaxation are in the supplementary material. However, it is not clear that this 
relaxation is possible in skew well-formed games, nor does it seem to provide any 
practical advantage. 

8 Conclusion 

We have provided the first set of theoretical guarantees for CFR in imperfect recall 
games. We defined well-formed and skew well-formed games and provided bounds 
on the average regret that results from applying CFR to such games. In addition, our 
theory shows that we can achieve low average regret in a full, perfect recall game when 
employing CFR on an abstract version of the game, provided the abstract game is skew 
well-formed (with or without imperfect recall). Our DRP experiments confirm these 
theoretical results, while our PTTT and Bluff experiments hint that it may be possible 
to still bound regret in other types of imperfect recall games. Future work will look to 
expand on the set of imperfect recall games to which CFR can be reliably applied. In 
particular, it may be possible to derive regret bounds for a new class of games where 
conditions (ii) and (iii) of Definition [2] are relaxed. 
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Appendix A 

In this section, we will prove Theorems 1 and 2 of the main paper. Note that by the 
definition of counterfactual value, the regrets between T and a perfect recall refinement 
f are additive; specifically, for I e I, in T, 

Rf(I,a)= J2 R I ( 5 ) 
iev(i) 

First, we provide a lemma that generalizes Theorem 4 of (Zinkevich et al., 2008) by 
showing that if the immediate counterfactual regrets of each I £ V(I) are proportional 
up to some difference D, then the average regret can be bounded above: 

Lemma A. Let f be a perfect recall refinement of a game T. If for all I G Xj, /, /' € 
V(I), and a e A(I), there exist constants Cj j, , Dj p S [0, oo) such that 



1 

f 



< Dfj, n , (6) 
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then the average regret in T is bounded by 



< 



v /ex 



where 



and 



C = y max Cj j, a 

f^JJ'eV(i)MeA(i) ' ' 

Di = max Dj j, . 

Ij'ev(i),aeA(i) ' ' 

Proof. 

Rj < V max Rj ,+ (i, a) by Theorem 3 of (Zinkevich et al., 2008) 

— max RT' + ( I, a) by definition of a perfect recall refinement 

/ex, i e p {1) 

< V \V(I)\Rf' + (I*,a*) where/* = argmax max Rf' + (I,a) 
zli ' lev(i) ^A(i) 

and a* = arg max R i ' + (/* , a) 

< £ I^WI (Ci,j,», a ,Rf< + (i**,a*)+TDi,j„ a ,) by©, 
where /** = argmini?^(7,a*) 

/e/>(/) 

< E f^T E +Tj2\V(I)\ Dl 
iei t \\ f i e p (I) J ieZi 

because the minimum is less than the average and (-) + is monotone increasing 

= E Ci,j„^Rj' + (I, a*) + T E \V(I)\Dj by Q 



<E C '/V»,a. T 



iei 4 



\ 



£ (^^) +r£|:P(J)|B, 
o£A(7) V / /ex, 



EA(/) 

< E C / . i/ .. ia .A iA /|I(7ji>/r + TE \Hl)\Di 
i€ii ieii 

by Theorem 6 of (Lanctot et al., 2009) 

< xc^/iaWt + t ]T \V(I)\Dj. 

/ex, 

Dividing both sides by T establishes the lemma. ■ 
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Note that if T has perfect recall, then the constants Cu. a = 1 and Di j a = for all 
I e Zj and a e A(I) satisfy the condition of Lemma A. In this case, C — and 
Di = 0, and so Rf /T < A l \l i \^/\A~\/VT, recovering Theorem 4 of (Zinkevich et 
al., 2008). 

We now use Lemma A to prove Theorems 1 and 2: 

Theorem 2.I/T is skew well-formed with respect to f , then the average regret in f 
for player i of choosing strategies according to CFR in Y is bounded by 

v /ex, 

where K = X)/eii max f I'ev(i) ^1 i'^i r an d $i = m a x / rev(i) ^1 i'^i /'• 
Proof. We will show that for all 7 e I,, 7, I' e V(I), and a e A(7), 

^ |i?f' + (/,a) - k ltl/ e u ,Rl> + (I',a)\ < Sjpijj,, (7) 

which, by Lemma A, proves the theorem. 

Fix I e 2~ 4 , 7,7' e P(7), and a e A(7). Firstly, for all z e and <r e £, by 
conditions (ii) and (iii) of Definition 3, we have 

7T^(z) = 7T c (z) J] (7(7, a) 

(/,o)ex_i(z) 
= ^,/,7r c (^)) [] a(7,a) 

(J,o)£X.(W(z)) 

= ^ / ,^ 4 (0(z)) (8) 
and by condition (iv) of Definition 3, we similarly have 

^{z[I],z)=^{4>{z)[I'U{z)) (9) 

and 

<(z[7>,z) = <(0(z)[7>,^(z)). (10) 

We can then bound the positive part of the immediate counterfactual regret Rj' + (I, a) 
above by 
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T 

^ 2 ^(z)«(z[/>,z)-<(z[/],z)K(z) 

v t=l Z£2j. 

<(EE ^,/^i(0W)«(^(«)[J>.0W) 

t=l zG-Zf 

<(^(z)[ J T']>(z)))(fc 7V , Mj (^(z)) + ty,,)^ 
by equations (|8j, |9]), (JTOj, and condition (i) of Definition 3 



= (EE '/,/'*-i(*)(<(*tfV*) 



t=i zeZj., 

-<(«[iV))(*J,*«i(*) + 
since is a bijection 

< EE kr,i>tfj>*U(*)KMa>z)-<M,*))M^ 

T 

E E h^ip^){<w]^z)~^(z[i],z)) 

T 

t=i 

< kpp£ppR['+(I\ a) + TSj pij p, (11) 

where the last line follows because 7rf^(J') = Y2zez-, 7r -i( z [^']) < 1 m a perfect 
recall game f\ Similarly, 

a) > fcj plppRj' + {T, a) ~ TSj plj p, (12) 
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which together with equation ( fTTj ) and dividing by T establishes ((TJ, completing the 
proof. ■ 

Note that Theorem 1 immediately follows from Theorem 2 since a well-formed game 
is skew well-formed with 5j j, =0 for all /, I' E V{I). 

Appendix B 

In this section, we consider an alternative extension of well-formed games that relaxes 
condition (iv) of Definition 2. For a subset of histories S C Hi, define 

Di{S) = {I | I eXi,3h E S, ti E / such that h C ft'} 

to be the set of all information sets descending from any history in S. 

Definition 4. For a game T and a perfect recall refinement T, we say that T is a nearly 
well-formed game with respect to T if for all i E N, I E 1%, I, I' E J E Di(I), 
there exist bijections (f> : Zj — > Zp, ip : Di(I) — > Dj(/'), lo : A(J) — > A(t/j(J)) and 
constants kf p,if p € [0, oo) smc/z that for all z E Zj: 

(i) tt» (2) = kjj,Ui((f)(z)), 

(ii) 7T c (z) = £jJ,TT c (4>(z)), 

(iii) 7n T, = an J 

(iv) = (Ji,ai), (J m ,a m ) ifandonlyif 
Xi(ct>(z)ll'},<f>(z)) = (V(Ji),w(a 1 )),...,(^(J m ),u;(a m )). 

Wfe iay fnaf T is a nearly well-formed game if it is nearly well-formed with respect to 
some perfect recall refinement. 

In a nearly well-formed game, condition (iv) says that player i may now remem- 
ber information that was once forgotten, provided the descendants from I and /' are 
isomorphic across (j>. This relaxes the corresponding condition for a well-formed game 
where player i could never remember information once it was forgotten. Clearly, any 
well-formed game is nearly well-formed by choosing ip and uj to be the identity bijec- 
tions. 

For example, consider a longer version of DRP, DRP-3, that consists of three bet- 
ting rounds instead of two where a third die is rolled at the beginning of round 3. We 
then define DRP-IR-3 to be the imperfect recall abstraction of DRP-3 where during 
round 2, players only know the sum of their two dice. In round 3, players once again 
know the outcome of each individual die roll, recovering information from the first 
round that was forgotten in the second. For instance, corresponding histories where 
player i's first two rolls were 1,5 and where her first two rolls were 4,2 will be in the 
same information set during round 2, but will be in different information sets in round 
3. However, betting is independent of dice rolls and utilities are only dependent on 
the final sum of the three dice. Therefore, the descendants from these histories are 
isomorphic across <j> and thus DRP-IR-3 is nearly well-formed with respect to DRP-3. 

CFR guarantees that the average regret is also minimized in nearly well-formed 
games: 
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Theorem 3.I/T is nearly well-formed with respect to V, then the average regret in T 
for player i of choosing strategies according to CFR in T is bounded by 

Rf A 2 X^4~| 



T ~ Vf 7 
where K = max /,/'eP(7) k i,r e iJ>- 

Proof. Fix I G Xi, 1,1' G P(I), and a G A(I). By conditions (ii) and (iii) of 
Definition 4, equation (|8]l holds. 

Claim: Rf(J, b) = kpplppRf(i>(J), u(b)) for all J G Di(I), b G A(J), T > 0. 
Provided the claim is true, we have 

TApTJ otherwise 



if HaeMJ) kpplppRf' + &(J)Mb)) > 



i 



otherwise 



\AWJ))\ 

since w is a bijection 

= a T+1 WJ),u(b)) (13) 

for all J G A(J), b G A(J), T > 0. Therefore, for t > 1, 

n a *^ b ) 

IJ <7*(^(J),o;(6)) 

(J,b)eXi(z[I],z) 

= jQ cr* (J, b) by condition (iv) of Definition 4 

(J,fc)ex i (0( 2 )[/'],0( Z )) 

= [J'UC*)), 

and thus equation |9} and similarly equation ([10 1 hold for cr = a 1 . By following the 
proof of Theorem 2, we then have that equations ( TTJ and (jT2j with 8j p = hold, and 
hence equation (|7]i with <5j p =0 holds. This establishes the theorem by Lemma A. 

To complete the proof, we are left to show that the claim holds. We will do so by 
induction on T. The base case T = holds since R®(I, a) = for all I G Ii, a G 
A(I). For the inductive step, assume that iJf-^J, ft) = k n ,e n ,Rf- 1 (tp{J),u}(b)) 

for all J G A (J), & G A(J). We will show that i?f (J, b) = kppippRf (i/j{J),u(b)) 

forallJe Di(I),be A(J). 
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Fix J e Di(I) and b € A(J). By equation (|T3j, we have for all zeZ, 



(J> ,b')eXi(z[J],z) 

JJ a T (ip(J'),uj(b')) by equation ([13) 

(j',b')eJfi(2[^],z) 

n wo 

(j',6')e^ 1 (0(2)W(J)],0(^)) 

by condition (iv) of Definition 4 since Xi(z[J], z) is a subsequence 
(more precisely, a suffix) of Xi(z[I], z) 
= < T (0(z)[^(J)],0(z)) (14) 

and similarly 

irf (z[J]b,z) =< r (0(z)[^(J)] W (6),0(z)). (15) 
Now consider the counterfactual regret at time T, 

rf(J,b)= J2 ^_ T l (z)(nf(z[J}b,z)~7rf(z[J},z))u l (z) 

= E ^Li^Z{<p{m<\mmj)Mb)^{z)) 

-7rf T (0(*)^(J)],^(«)))*/ i f,«iW*)) 
by equations ( [14) , ( fTB} and conditions (i), (ii), and (iii) of Definition 4 

= ll»k I j,r?MJ)M1>))- 



Finally, 



«=i 

= R?-\j,b)+rT{J,b) 
by the induction hypothesis and the above 



t=l 

= £ ftl! k 1<1 ,Rf(i ) (J),uj(b)), 
establishing the inductive step. This completes the proof. 
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